Tittle: Identification of Marker Genes in Human Breast Tumors using scRNA¶
Introduction¶
Tamoxifen is a selective estrogen receptor modulator (SERM) medication used to treat breast cancer. breast cancer is a major type of cancer in females. Detecting cancer early poses a significant challenge due to its inherent heterogeneity. In ER-positive, HER2-negative breast cancer, multiple measures of intra-tumor heterogeneity are associated with worse response to endocrine therapy
Recent progress in next-generation technologies has facilitated the exploration of novel gene expressions involved in breast cancer pathogenesis. The utilization of Next-Generation Sequencing (NGS) methods has generated extensive datasets, necessitating computationally intensive analyses. Consequently, machine learning models have been deployed to effectively analyze these large datasets. Machine learning algorithms have demonstrated their effectiveness and computational efficiency in identifying disease biomarkers from high-dimensional datasets.
In this project, we leverage a single-cell-RNA dataset combined with machine learning techniques to identify and determine breast cancer biomarkers. To identify gene signatures associated with breast cancer for six different types of breast cancer tumors, we will use Scanpy.
Research Questions¶
- Which genes are differentially expressed in breat cancer?
- Which genes are co-expressed in breast cancer?
- What is the expression pattern of differentially expressed genes per cell type?
- Which pathways are enriched in different cell types in breast cancer?
Dataset¶
The dataset under consideration comprises samples from the population of live human tumors and normal breast specimens tha were taken immediately after surgical resection for processing into single-cell workflows for experimentation and genomic analyses.I this study, we show differences in tamoxifen response by cell type. The data set was obtained from NCBI GEO database under the accession number GSE245601
Methodology¶
- Data Acqusition: h5 format
- Preprocessing and Quality control: Gene, cell and mitochondria filtering
- Clustering: UMAP
- Data Visualization: Dot plots,violin, track plots,scatter plots, heatmaps
- Benchmarking: Literature Search
# ignore warnings and standasrd outputs
# Filtering warnings from current version of matplotlib
import warnings
warnings.filterwarnings(
"ignore", message=".*Parameters 'cmap' will be ignored.*", category=UserWarning
)
warnings.filterwarnings(
"ignore", message="Tight layout not applied.*", category=UserWarning
)
Workflow¶
from IPython.display import Image, display
# Replace 'path_to_your_image.jpg' with the actual path to your image file
image_path = 'new_workflow.png'
# Display the image
display(Image(filename=image_path))
Data Acquisition¶
The datset was downloaded using ftp commandline as follows Counts data
!wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE118nnn/GSE118389/suppl/GSE118389%5Fnorm%5Fdata%2Etxt%2Egz
The counts data represent the number of RNA sequencing reads that have been aligned to each gene in the transcriptome. These counts can be indicative of the level of gene expression in the samples. The file contains gene expression data in the form of counts, derived from RNA sequencing experiments
Normalized counts data ! wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE118nnn/GSE118389/suppl/GSE118389%5Fnorm%5Fdata%2Etxt%2Egz
This file likely contains normalized gene expression data from the same RNA sequencing experiments. Normalization is a process used to adjust for technical variations between samples, allowing for more accurate comparisons of gene expression levels across different experimental conditions or samples.
Scanpy set up and loading required libraries¶
import pandas as pd
import scanpy as sc
import os
import numpy as np
import scipy
import scipy.io as sio
import pandas as pd
import warnings
import decoupler
import seaborn.objects as so
warnings.simplefilter(action='ignore', category=FutureWarning)# supress some warnings
sc.settings.verbosity = 3
Data integration¶
The data was in multiple .h5 files. We used a function to merge the different files into a single file
import h5py
import os
# output_file = "output.h5"
folder_path = "../data/"
# output_file = "../data/output.h5"
# Get a list of H5 files in the folder
adatas = []
# d_names = ["../data/"+f for f in os.listdir(folder_path) if f.endswith('.h5')]
d_names = [
'GSM7845546_Normal_01_Control.h5',
'GSM7845548_Normal_02_Control.h5',
# 'GSM7845550_T47D_Control.h5',
'GSM7845552_Tumor_01_Control.h5',
'GSM7845554_Tumor_02_Control.h5',
'GSM7845556_Tumor_03_Control.h5',
'GSM7845558_Tumor_04_Control.h5',
# 'GSM7845560_Tumor_05_Control.h5',
# 'GSM7845562_Tumor_06_Control.h5',
# 'GSM7845564_Tumor_07_Control.h5',
'GSM7845566_Tumor_08_Control.h5',
# 'GSM7845568_Tumor_09_Control.h5',
# 'GSM7845570_Tumor_10_Control.h5'
]
for filename in d_names:
adata = sc.read_10x_h5('../data/'+filename)
adata.var_names_make_unique()
adatas.append(adata)
adata = adatas[0].concatenate(adatas[1:])
# Save the concatenated AnnData object to an HDF5 file
output_h5_file = "../data/merged.h5" # Adjust the output file name
adata.write(output_h5_file)
reading ../data/GSM7845546_Normal_01_Control.h5 (0:00:02)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
reading ../data/GSM7845548_Normal_02_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:02)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
reading ../data/GSM7845552_Tumor_01_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:02)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
reading ../data/GSM7845554_Tumor_02_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:01)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
reading ../data/GSM7845556_Tumor_03_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:02)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
reading ../data/GSM7845558_Tumor_04_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:00) reading ../data/GSM7845566_Tumor_08_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:01)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
data= sc.read_h5ad("../data/merged.h5")
data
AnnData object with n_obs × n_vars = 39591 × 33538
obs: 'batch'
var: 'gene_ids', 'feature_types', 'genome'
data.obs
| batch | |
|---|---|
| AAACCCAAGTCAGGGT-1-0 | 0 |
| AAACCCACATCACAGT-1-0 | 0 |
| AAACCCATCTCAATCT-1-0 | 0 |
| AAACGAACAAGTTCGT-1-0 | 0 |
| AAACGAACACAACGAG-1-0 | 0 |
| ... | ... |
| TTTGATCTCTAGACAC-1-6 | 6 |
| TTTGGTTGTTCCTAAG-1-6 | 6 |
| TTTGTTGAGGGCATGT-1-6 | 6 |
| TTTGTTGAGGTAATCA-1-6 | 6 |
| TTTGTTGGTCCGGACT-1-6 | 6 |
39591 rows × 1 columns
# # Get a list of H5 files in the folder in a specific order
# # You can change the order as needed
# d_names = [
# 'GSM7845546_Normal_01_Control.h5',
# 'GSM7845548_Normal_02_Control.h5',
# 'GSM7845550_T47D_Control.h5',
# 'GSM7845552_Tumor_01_Control.h5',
# 'GSM7845554_Tumor_02_Control.h5',
# 'GSM7845556_Tumor_03_Control.h5',
# 'GSM7845558_Tumor_04_Control.h5',
# 'GSM7845560_Tumor_05_Control.h5',
# 'GSM7845562_Tumor_06_Control.h5',
# 'GSM7845564_Tumor_07_Control.h5',
# 'GSM7845566_Tumor_08_Control.h5',
# 'GSM7845568_Tumor_09_Control.h5',
# 'GSM7845570_Tumor_10_Control.h5'
# ]
# Initialize a list to store labels based on the file names
labels = []
# Dictionary to store the number of observations in each file
observations_count = {}
# Iterate over H5 files and get the number of observations
for i in d_names:
path = os.path.join(folder_path, i)
# # Read AnnData object using Scanpy
adata = sc.read_10x_h5(path)
# # Get the number of observations
num_observations = adata.n_obs
# print(num_observations)
observations_count[i] = num_observations
# # Determine label based on file name
label = "Normal" if "Normal" in i else "Tumor"
# # Repeat the label for the corresponding number of observations
labels.extend([label] * num_observations)
reading ../data/GSM7845546_Normal_01_Control.h5 (0:00:01)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
reading ../data/GSM7845548_Normal_02_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:02)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
reading ../data/GSM7845552_Tumor_01_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:02)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
reading ../data/GSM7845554_Tumor_02_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:01)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
reading ../data/GSM7845556_Tumor_03_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:01)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
reading ../data/GSM7845558_Tumor_04_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:00) reading ../data/GSM7845566_Tumor_08_Control.h5
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
(0:00:00)
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/usr/local/lib/python3.10/dist-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
observations_count
{'GSM7845546_Normal_01_Control.h5': 4267,
'GSM7845548_Normal_02_Control.h5': 12087,
'GSM7845552_Tumor_01_Control.h5': 7463,
'GSM7845554_Tumor_02_Control.h5': 5901,
'GSM7845556_Tumor_03_Control.h5': 4820,
'GSM7845558_Tumor_04_Control.h5': 1994,
'GSM7845566_Tumor_08_Control.h5': 3059}
len(labels)
39591
# Add labels to the concatenated AnnData object
data.obs['label'] = labels
data.obs
| batch | label | |
|---|---|---|
| AAACCCAAGTCAGGGT-1-0 | 0 | Normal |
| AAACCCACATCACAGT-1-0 | 0 | Normal |
| AAACCCATCTCAATCT-1-0 | 0 | Normal |
| AAACGAACAAGTTCGT-1-0 | 0 | Normal |
| AAACGAACACAACGAG-1-0 | 0 | Normal |
| ... | ... | ... |
| TTTGATCTCTAGACAC-1-6 | 6 | Tumor |
| TTTGGTTGTTCCTAAG-1-6 | 6 | Tumor |
| TTTGTTGAGGGCATGT-1-6 | 6 | Tumor |
| TTTGTTGAGGTAATCA-1-6 | 6 | Tumor |
| TTTGTTGGTCCGGACT-1-6 | 6 | Tumor |
39591 rows × 2 columns
Data Wrangling/ Exploratory Data Analysis¶
In this section, we will answer below questions
- How many genes are in the raw count dataset?
- How many cells?
- Are there mitochondria contaminants?
- Which genes are highly expressed?
Number of cells (observations or n_obs): 4267
Number of genes (variables or n_vars): 33538
# data
#check total number of mitochondrial genes in the dataset
data.var_names.str.contains("MT-").sum()
14
# annotate the group of mitochondrial genes as 'MT'
data.var['MT'] = data.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(data, qc_vars=['MT'], percent_top=None, log1p=False, inplace=True)
# check the genes and count statistics
data.var
| gene_ids | feature_types | genome | MT | n_cells_by_counts | mean_counts | pct_dropout_by_counts | total_counts | |
|---|---|---|---|---|---|---|---|---|
| MIR1302-2HG | ENSG00000243485 | Gene Expression | GRCh38 | False | 1 | 0.000025 | 99.997474 | 1.0 |
| FAM138A | ENSG00000237613 | Gene Expression | GRCh38 | False | 0 | 0.000000 | 100.000000 | 0.0 |
| OR4F5 | ENSG00000186092 | Gene Expression | GRCh38 | False | 6 | 0.000152 | 99.984845 | 6.0 |
| AL627309.1 | ENSG00000238009 | Gene Expression | GRCh38 | False | 112 | 0.002829 | 99.717107 | 112.0 |
| AL627309.3 | ENSG00000239945 | Gene Expression | GRCh38 | False | 18 | 0.000455 | 99.954535 | 18.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| AC233755.2 | ENSG00000277856 | Gene Expression | GRCh38 | False | 5 | 0.000126 | 99.987371 | 5.0 |
| AC233755.1 | ENSG00000275063 | Gene Expression | GRCh38 | False | 2 | 0.000126 | 99.994948 | 5.0 |
| AC240274.1 | ENSG00000271254 | Gene Expression | GRCh38 | False | 1693 | 0.056907 | 95.723776 | 2253.0 |
| AC213203.1 | ENSG00000277475 | Gene Expression | GRCh38 | False | 0 | 0.000000 | 100.000000 | 0.0 |
| FAM231C | ENSG00000268674 | Gene Expression | GRCh38 | False | 0 | 0.000000 | 100.000000 | 0.0 |
33538 rows × 8 columns
#make the gene names unique
data.var_names_make_unique()
Quality Control and Filtering¶
- Mitochondrial gene composition. Mitochondrial genes often start with "MT-" or "MTRNR," and their inclusion helps estimate the mitochondrial composition of each cell. Most of the cells have a mitochondria count percentage below 15%. The cells with higher than 15% mitochondria count may signify cell stress and damage and were therefore filterd out.
- Cell counts. Cells with less than 500 genes were excluded from downstream analysis. The low gene count in cells can be due to low capture, inefficient cell lysis or non-viable cells. Each violin plot represents the distribution of the number of genes expressed per cell.
- Gene counts
- Cells with less than 20000 total count were excluded from the analysis.
Violin Plot¶
sc.pl.violin(
data,
[
'n_genes_by_counts',
'total_counts',
'pct_counts_MT'
],
multi_panel=True,
# save='_gene_MT_vs_transcript_counts',
);
Scatter Plots¶
The plots indicate that some reads have a relatively high percentage of mitochondrial counts which are often associated with cell degradation. However, most of the cells have less than 15% mitochondria percentage which. We can also observe most cells have total counts of below 20000. For gene counts per cell, most cells approximately contain about 5000 gene counts on average. These visualizations were used to set the quality control thresholds. We also filter out genes that are not detected in at least 20 cells as these are not informative.
sc.pl.scatter(data, "total_counts", "n_genes_by_counts")
The scatter plot shows gene counts per each gene. The output shows presence of potential outliers for gene counts
sc.pl.scatter(data, "total_counts", "pct_counts_MT")
From the scatter plot, few cells have mitochondria composition above 15% which is the desirable threshold. We will filter cells with a percentage count above 15% mitochondria count
# Filter the data
data = data[data.obs.n_genes_by_counts < 5000,:]
data = data[data.obs.total_counts < 20000,:]
data = data[data.obs.pct_counts_MT < 15,:]
# Min 20 cells - filters out 0 count genes
sc.pp.filter_genes(data, min_cells=20)
data.shape # Checking number of cells remaining
filtered out 13761 genes that are detected in less than 20 cells
/usr/local/lib/python3.10/dist-packages/scanpy/preprocessing/_simple.py:250: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual. adata.var['n_cells'] = number
(33434, 19777)
We have filtered cells with less than 500 genes, cells with less than 20000 total count and cells with more than 15% mitochondrial gene composition and genes that are only present in 20 cells or less.
Violin plots after QC¶
After filtering, the number of genes remaining is 2972 and the number of cells remainaing is 33538
sc.pl.violin(
data,
[
'n_genes_by_counts',
'total_counts',
'pct_counts_MT'
],
jitter=1,
multi_panel=True,
# save='Filtered_gene_MT_vs_transcript_counts'
)
After the quality control, we can see the interquartile ranges for gene counts, total counts and mitochondria composition are within the expected ranges based on the previous filtered parameters.
Scatter plot after QC¶
sc.pl.scatter(data, x='total_counts', y='n_genes_by_counts')
The scatter plot shows the distribution of the number of gene counts. We can see the gene counts increase with the increase in total counts for the genes. The plot also shows the presence of potential outliers in some gene counts. Before filtering, we can normalize the counts and visualize nwe the distribution.
Normalization¶
This step is performed to account for technical biases and variability across cells, ensuring that the observed differences in gene expression are reflective of biological signals rather than technical artifacts. For example
- Each cell in a scRNA-seq experiment may have a different number of RNA molecules, resulting in variability in library sizes
- samples are processed at different times, and batch effects can introduce unwanted variability.
- Longer genes generally produce more sequencing reads, and some cells may be sequenced more deeply than others.
- Technical variations such as amplification bias, and sequencing biases
In this project, transcripts per million (TPm) normalization and log normalization have been used. The log normalization was adopted since scRNA has high variance.
#normarization
data.layers["counts"] = data.X.copy() # preserve counts
sc.pp.normalize_total(data, target_sum=1e6) # scale each cell to a common library size
sc.pp.log1p(data) # log(expression + 1)
data.raw = data # freeze the state in `.raw`
normalizing counts per cell
finished (0:00:00)
Gene counts after QC and Normalization¶
print("min_gene count is ", data.obs["n_genes_by_counts"].min(), "\n max_gene count is ",data.obs["n_genes_by_counts"].max())
min_gene count is 52 max_gene count is 4997
data.obs["n_genes_by_counts"].max()
4997
Highly variable genes¶
Discarding non-significant genes that are equally expressed across the cells is a preprocessing step in scRNA-seq analysis. This process is performed to reduce noise, to focus on the most informative genes, and improve the efficiency of downstream analyses. After extracting highly variable genes, we see that only 6684 genes remained for downstream analysis.
#determining High variable genes
sc.pp.highly_variable_genes(data, min_mean=0.0125, max_mean=3, min_disp=0.25)
sc.pl.highly_variable_genes(data)
extracting highly variable genes
finished (0:00:03)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
The plots show that before normalization, the dispersion of genes have wide range from 3.2 to 3.9. Aftwr bnormalization, there is positive and negative dispersion ranging from -2 to 2. Genes with a mean expression below 0.25 dispersion threshold were excluded from consideration as highly variable genes
data.raw=data
data = data[:, data.var['highly_variable']]
sc.pp.scale(data, zero_center=True, max_value=3)
/usr/local/lib/python3.10/dist-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Received a view of an AnnData. Making a copy. view_to_actual(adata)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
print("min_nom",data.var["dispersions_norm"].min(),"\nmax_norm is",data.var["dispersions_norm"].max())
min_nom 0.25004742 max_norm is 15.131056
sc.pl.highest_expr_genes(data, n_top=20)
normalizing counts per cell
/usr/local/lib/python3.10/dist-packages/scanpy/preprocessing/_normalization.py:196: UserWarning: Some cells have zero counts
warn(UserWarning('Some cells have zero counts'))
finished (0:00:00)
From the plot, the gene which is most variable is MBD6
top_1500Genes=sc.pp.highly_variable_genes(data, flavor="seurat", n_top_genes=1500)
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
finished (0:00:03)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
/usr/local/lib/python3.10/dist-packages/scanpy/preprocessing/_highly_variable_genes.py:212: RuntimeWarning: invalid value encountered in log dispersion = np.log(dispersion)
Dimentionality reduction (PCA)¶
From the plot, among highest variable gene is ABCC9
# # Perform PCA
# sc.tl.pca(data)
# # Visualize PCA results
# sc.pl.pca(data, color='leiden')
# Perform PCA
sc.tl.pca(data)
# Visualize PCA loadings
sc.pl.pca_loadings(data, show=True) #
computing PCA
on highly variable genes
with n_comps=50
finished (0:00:29)
From the PCA results, we can see that gene KRT80 is the most differentially expressed in the cells for PC1. Other genes such as ABCA10 contribute to almost no variation in the expression patterns for PCA1. In PC2, the LOXL2 gene contributes most to the observed gene expression patterns in the cells. Some genes such as STXBP2 seem to be very lowly expressed in some cells and it contributes to the variation in gene expression patterns for PC2. For PC3, the gene NUAK1 is upregualted in some cells causing the variation in the expression patterns whereas gene ELP6 seems to be underexpressed among the cell types. These genes can therefore be potential biomarkers for breast cancer disease. The number of PCA to use varies with the sample and optimal PCA should be determined for optimal variation in the samples. We can extract the contribution of each PCA and accumulated variance and pick the optimal at the elbow position of th plot as shown in the next figure. The optimal PCAs are 9.
Perone, Y., Farrugia, A. J., Rodríguez-Meira, A., Győrffy, B., Ion, C., Uggetti, A., ... & Magnani, L. (2019). SREBP1 drives Keratin-80-dependent cytoskeletal changes and invasive behavior in endocrine-resistant ERα breast cancer. Nature communications, 10(1), 2115. (KRT80)
Mamoor, S. (2023). ABCA10 is a differentially expressed gene in brain metastatic human breast cancer. (ABCA10)
Ahn, S. G., Dong, S. M., Oshima, A., Kim, W. H., Lee, H. M., Lee, S. A., ... & Green, J. E. (2013). LOXL2 expression is associated with invasiveness and negatively influences survival in breast cancer patients. Breast cancer research and treatment, 141, 89-99. (LOXL2)
Kuehn, J., Espinoza‐Sanchez, N. A., Teixeira, F. C., Pavão, M. S., Kiesel, L., Győrffy, B., ... & Götte, M. (2021). Prognostic significance of hedgehog signaling network‐related gene expression in breast cancer patients. Journal of Cellular Biochemistry, 122(5), 577-597. (ELP6)
Orlandella, F. M., Mariniello, R. M., Mirabelli, P., De Stefano, A. E., Iervolino, P. L. C., Lasorsa, V. A., ... & Salvatore, G. (2020). miR-622 is a novel potential biomarker of breast carcinoma and impairs motility of breast cancer cells through targeting NUAK1 kinase. British journal of cancer, 123(3), 426-437. (NUAK1)
# Visualize PCA variance ratio
sc.pl.pca_variance_ratio(data, n_pcs=20, log=True, show=True)
Clustering¶
We will use UMAP and tSNE clustering and 8 PCAs already determined
We will pick number of components to be 8 as this is the optimal PCS from the output of the PCs tested. Only PC1 has positive positive loadings indicating higher expression levels of those genes contribute to higher values of that PC. The remaininng PCs suggest that lower expression levels of those genes contribute to higher values of that PC.
sc.pp.neighbors(data,n_pcs=8)
sc.tl.umap(data)
sc.tl.leiden(data, key_added="clusters")
computing neighbors
using 'X_pca' with n_pcs = 8
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:01:15)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:59)
running Leiden clustering
finished: found 18 clusters and added
'clusters', the cluster labels (adata.obs, categorical) (0:01:26)
sc.tl.leiden(data)
running Leiden clustering
finished: found 18 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:01:34)
Embedding¶
The leiden algorithm groups cells based on the closeness of their expression patterns. We will use tSNA,and UMAP
# Embeddings'
# Perform t-SNE
sc.tl.tsne(data,n_pcs=8)
## tSNE
sc.pl.tsne(data, color='leiden')
computing tSNE
using 'X_pca' with n_pcs = 8
using sklearn.manifold.TSNE
finished: added
'X_tsne', tSNE coordinates (adata.obsm) (0:04:10)
/usr/local/lib/python3.10/dist-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter(
We can see 19 distinct clusets for the cell types after tSNE. t-SNE is a nonlinear dimensionality reduction technique commonly used for visualizing high-dimensional data in two or three dimensions. It emphasizes the preservation of local structures, making it well-suited for revealing clusters or groups of cells with similar gene expression profiles.
# # Compute a custom embedding (e.g., UMAP)
# sc.tl.umap(data,n_components=9)
# # Visualize cells in the UMAP space
# sc.pl.embedding(data, basis='umap', color='leiden')
# import numpy as np
# import seaborn as sns
# np.random.seed(42)
# data.obs['condition'] = np.random.choice(['healthy', 'tumor'], size=len(data))
# # Compute UMAP embedding
# sc.tl.umap(data)
# # Visualize the density of cells in the UMAP embedding per condition
# sc.pl.embedding_density(data, basis='umap', key='condition', cmap=sns.color_palette("Set2", as_cmap=True))
# plt.show()
data.obs
| batch | label | n_genes_by_counts | total_counts | total_counts_MT | pct_counts_MT | clusters | leiden | |
|---|---|---|---|---|---|---|---|---|
| AAACCCAAGTCAGGGT-1-0 | 0 | Normal | 1036 | 3271.0 | 227.0 | 6.939774 | 8 | 8 |
| AAACCCACATCACAGT-1-0 | 0 | Normal | 1720 | 6521.0 | 70.0 | 1.073455 | 8 | 8 |
| AAACCCATCTCAATCT-1-0 | 0 | Normal | 1190 | 3436.0 | 299.0 | 8.701980 | 16 | 16 |
| AAACGAACAAGTTCGT-1-0 | 0 | Normal | 2464 | 7967.0 | 91.0 | 1.142212 | 8 | 8 |
| AAACGAATCTACCCAC-1-0 | 0 | Normal | 2851 | 9582.0 | 67.0 | 0.699228 | 8 | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| TTTGATCTCCCAGCGA-1-6 | 6 | Tumor | 1416 | 3035.0 | 43.0 | 1.416804 | 12 | 12 |
| TTTGATCTCTAGACAC-1-6 | 6 | Tumor | 483 | 845.0 | 26.0 | 3.076923 | 1 | 1 |
| TTTGTTGAGGGCATGT-1-6 | 6 | Tumor | 2577 | 6001.0 | 484.0 | 8.065323 | 9 | 9 |
| TTTGTTGAGGTAATCA-1-6 | 6 | Tumor | 377 | 522.0 | 25.0 | 4.789272 | 1 | 1 |
| TTTGTTGGTCCGGACT-1-6 | 6 | Tumor | 405 | 1072.0 | 20.0 | 1.865672 | 7 | 7 |
33434 rows × 8 columns
Clustering of cell types¶
Leiden algorithm was be used for hierarchical clustering, allowing for to discover and characterize cell types or states within a heterogeneous sample. We will compare tSNE and UMAP clustering to see which one provides more reasonale clusteers. In each case we will use PCAs of 9.
UMAP clustering¶
UMAP was used as an effective preprocessing step to boost the performance of density based clustering.
sc.pp.neighbors(data,n_pcs=8)
sc.tl.umap(data,n_components=8)
sc.tl.leiden(data, key_added="clusters")
computing neighbors
using 'X_pca' with n_pcs = 8
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:07)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:01:03)
running Leiden clustering
finished: found 18 clusters and added
'clusters', the cluster labels (adata.obs, categorical) (0:01:16)
sc.tl.leiden(data)
running Leiden clustering
finished: found 18 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:01:16)
sc.pl.umap(data, color=['leiden'], legend_fontsize=8)
/usr/local/lib/python3.10/dist-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter(
Ranking genes per cell type¶
# Extracting top 1500 genes
top_1500Genes=sc.pp.highly_variable_genes(data, flavor="seurat", n_top_genes=1500)
top_1500Genes
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
finished (0:00:03)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
/usr/local/lib/python3.10/dist-packages/scanpy/preprocessing/_highly_variable_genes.py:212: RuntimeWarning: invalid value encountered in log dispersion = np.log(dispersion)
We can again see 19 clusters from UMAP clustering. However, the tSNE seems to be giving cleaer distictness among the clusters. In the next steps, we are going to unpack each category to determine which genes are differentially expressed in the cell categories.
import scanpy as sc
import pandas as pd
from matplotlib import rcParams
sc.tl.rank_genes_groups(data, 'leiden', method='t-test')
# The head function returns the top n genes per cluster
top_markers = pd.DataFrame(data.uns['rank_genes_groups']['names']).head(5)
# print(top_markers)
top_markers
ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:18)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PIP | B2M | ANGPT2 | PTPRC | KRT7 | PIP | KRT7 | KRT18 | GEM | KRT8 | KRT19 | COL6A2 | C11orf96 | SRGN | SRGN | CXCR4 | TCF4 | CD3D |
| 1 | KRT7 | RPS27 | CXCL8 | B2M | MGST1 | KRT7 | KRT19 | KRT8 | PPP1R15A | KRT18 | KRT18 | LGALS1 | COL6A2 | TPSAB1 | FTL | CYBA | IGFBP7 | PTPRC |
| 2 | CXCL8 | TMSB4X | TM4SF1 | SARAF | TM4SF1 | KRT19 | PIP | ELF3 | ANXA5 | SPINT2 | PLCG2 | DCN | IGFBP7 | HPGDS | TYROBP | HLA-DRA | GNG11 | CD3G |
| 3 | SOD2 | PTPRC | PIP | CD3D | WFDC2 | FDCSP | WFDC2 | KRT19 | HNRNPH3 | JPT1 | KRT8 | LUM | LUM | FCER1G | VIM | CD79A | IFI27 | HCLS1 |
| 4 | MGP | CXCR4 | HMOX1 | CYBA | CD24 | CXCL8 | TM4SF1 | SPINT2 | LGALS1 | TNFRSF12A | SAT1 | GEM | DCN | TPSB2 | CTSL | B2M | ID3 | LCP1 |
Top Marker genes by cluster¶
The top gene per category of cells have been extracted for spatial analysis i.e to get the patterns of tgeir distribution in other cells. From the results of the marker gene extraction after log nomalization and quakity control, CTSL gene is the most variable gene across the cell types.
steps¶
- Obtain highly variable genes
- Rank the genes in order of counts
- Cluster the genes into distinct groups
- Rank the genes in each cluster
top_genes_list = []
# Iterate over the columns (categories) of the DataFrame
for col in top_markers.columns:
# Extract the top genes for the current category and convert them to a list
top_genes_sublist = top_markers[col].tolist()[0]
# Append the sublist of top genes to the main list
top_genes_list.append(top_genes_sublist)
top_genes_list.insert(13,"KRT8") # check confirmation from literature of co-experession between KRT8 and KRT18
# Print the list of lists
top_genes_list
['PIP', 'B2M', 'ANGPT2', 'PTPRC', 'KRT7', 'PIP', 'KRT7', 'KRT18', 'GEM', 'KRT8', 'KRT19', 'COL6A2', 'C11orf96', 'KRT8', 'SRGN', 'SRGN', 'CXCR4', 'TCF4', 'CD3D']
marker_genes = list(set(top_genes_list))
# Marker genes per cluster
ax = sc.pl.stacked_violin(data, marker_genes, groupby='leiden',
var_group_positions=[(7, 8)], var_group_labels=['NK'])
Genes such as CTSL are expressed in more than one type of cells in clusters 12, 10, 1 and 0. We can visualize the gene clustering using a dot plot with a dendrogram to show the relatedness and level of the gene expression across different cells as shown below. Track plot also shows the gene expression profiles using baar graphs as shown below.
ax = sc.pl.dotplot(data, marker_genes, groupby='leiden', dendrogram=True, dot_max=0.5, dot_min=0.3, standard_scale='var')
WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_leiden']`
/usr/local/lib/python3.10/dist-packages/scanpy/plotting/_dotplot.py:747: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored dot_ax.scatter(x, y, **kwds)
From the dot plot and using the top selected genes per cluster, expression patterns in clusters 8 and 9 are closer than other expression patterns. Also, classes 2 and 10 are closely related in their expression patterns. Among the cell types, CTSL is the highest expressed in all the six cell types. AREG gene is highly expressed in two cell types for clusters 9 and 5. The output also suggests co-expression patterns for KRT8 and KRT18. This is evident by the fact that clusters that show differential expression for the AREG gene also have differential expression of the KRT18 gene. Some genes are also shown to be interrelated and may be involved in similar metabolic pathways. For example. In cluster 11, CTSL, SRGN, and HLA-DQA1 are highly expressed and their overexpressed seems to cause the underexpression of AREG, TNFA1P6, and CXCR4.
Cell types based on marker genes¶
adata.obs.rename({"label": "condition"}, axis=1, inplace=True)
adata.obs.cell_type.value_counts()
# adata.obs["condition"].replace({"ctrl": "control", "stim": "stimulated"}, inplace=True)
We can use the marker gene to identify the cell types. These marker genes were obtained from the literature.
marker_genes_dict = {'B-cell': ['CD79A', 'MS4A1'],
'T-cell': 'CD3D',
'T-cell CD8+': ['CD8A', 'CD8B'],
'NK': ['GNLY', 'NKG7'],
'Myeloid': ['CST3', 'LYZ'],
# 'Monocytes': ['FCGR3A'],
# 'Dendritic': ['FCER1A']
}
ax = sc.pl.dotplot(data, marker_genes_dict, groupby='leiden', dendrogram=True,
standard_scale='var', smallest_dot=40, color_map='Blues', figsize=(8,5))
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: 0, 1, 2, etc. var_group_labels: B-cell, T-cell, T-cell CD8+, etc.
/usr/local/lib/python3.10/dist-packages/scanpy/plotting/_dotplot.py:747: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored dot_ax.scatter(x, y, **kwds)
Track plots¶
Track plots is another way of visualizing expression patterns of genes per cell type
# Track plot data is better visualized using the non-log counts
import numpy as np
ad = data.copy()
ad.raw.X.data = np.exp(ad.raw.X.data)
ax = sc.pl.tracksplot(ad,marker_genes, groupby='leiden')
We can see similar expression patterns in clusster 6 and 7 . These clusters show similar density pattens
Marker gene Visualization¶
#v one vs rest visualization
rcParams['figure.figsize'] = 4,4
rcParams['axes.grid'] = True
sc.pl.rank_genes_groups(data)
Among the top cell-specific expressed genes are 'CTSL', 'GEM', 'PLCG2', 'DCN', 'AREG', 'SPINT2', 'PLCG2', 'KRT7', 'KRT18', 'HLA-C', 'GSTO1', 'IGFBP7', 'RPS27', 'KRT8'
sc.pl.rank_genes_groups_dotplot(data, n_genes=5)
/usr/local/lib/python3.10/dist-packages/scanpy/plotting/_dotplot.py:747: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored dot_ax.scatter(x, y, **kwds)
Plot showing top 4 gene expression patterns per cell category. For example, cluser 6 and 2 show similar expression patterns for gene MALAT 1 Cluster 7 and Cluster 6 show closely related expression patterns for the differentially expressed genes. There is evidence for co-expression patterns in genes such as IGFBP7 and FI27 in cluster 11. The MALAT1 gene has overexpression patterns in cell types 2 and 6
Heat Map for differentially exprssed genes per cell type category¶
sc.tl.rank_genes_groups(data, "clusters", method="t-test")
sc.pl.rank_genes_groups_heatmap(data, groups="5", n_genes=25, groupby="clusters")
ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:17)
WARNING: dendrogram data not found (using key=dendrogram_clusters). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_clusters']`
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: 5
S1006A stands out as the most overexpressed gene in most of the cell types. Clusers 4 and 11 show the highest differentially expressed genes
Pathway Analysis¶
Let’s extract the ranks for genes differentially expressed in cluster 0. We use these ranks and the gene sets from REACTOME to find gene sets enriched in this cell population compared to all other populations using GSEA as implemented in decoupler.
data2= sc.read_10x_h5("../data/merged.h5")
data2
reading ../data/output.h5 (0:00:02)
AnnData object with n_obs × n_vars = 4267 × 33538
var: 'gene_ids', 'feature_types', 'genome'
data2.obs
| AAACCCAAGTCAGGGT-1 |
|---|
| AAACCCACATCACAGT-1 |
| AAACCCATCTCAATCT-1 |
| AAACGAACAAGTTCGT-1 |
| AAACGAACACAACGAG-1 |
| ... |
| TTTGGTTGTAGTTCCA-1 |
| TTTGTTGCAGGCCTGT-1 |
| TTTGTTGGTAATGATG-1 |
| TTTGTTGTCCTACAAG-1 |
| TTTGTTGTCGCTGCGA-1 |
4267 rows × 0 columns
We will define a function to get differentially expressed genes in each cell type vs the rest. We will then call the function and pass in the cluster of the cell category to obtain the sorted ranks of genes in order of differential expression
# Storing the counts for later use
data2.layers["counts"] = data2.X.copy()
# Renaming label to condition
data2.obs = data2.obs.rename({"label": "condition"}, axis=1)
# Normalizing
sc.pp.normalize_total(data2)
sc.pp.log1p(data2)
normalizing counts per cell
finished (0:00:00)
WARNING: adata.X seems to be already log-transformed.
top_4000Genes=sc.pp.highly_variable_genes(data2, flavor="seurat", n_top_genes=4000)
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
finished (0:00:01)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
sc.pp.pca(data2)
sc.pp.neighbors(data2)
sc.tl.umap(data2)
computing PCA
on highly variable genes
with n_comps=50
finished (0:00:16)
computing neighbors
using 'X_pca' with n_pcs = 50
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:01)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:18)
adata.obs["group"] = data2.obs.("string") + "_" + adata.obs.cell_type
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) /tmp/ipykernel_19840/2106892138.py in ?() ----> 1 data2.obs["group"] = data2.obs.condition.astype("string") + "_" + data2.obs.cell_type /usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in ?(self, name) 6200 and name not in self._accessors 6201 and self._info_axis._can_hold_identifiers_and_holds_name(name) 6202 ): 6203 return self[name] -> 6204 return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'condition'
def get_scores(cluster):
# Perform differential expression testing with the desired method ('t-test')
scores = pd.DataFrame(data.uns['rank_genes_groups']['scores'])[cluster]
gene_names = pd.DataFrame(data.uns['rank_genes_groups']['names'])[cluster]
# Create a DataFrame with gene names in the first column and scores in the second column
top_markers = pd.DataFrame({'Gene Names': gene_names, 'Scores': scores})
# sort the DataFrame by absolute scores
markers = top_markers.iloc[np.abs(top_markers['Scores']).argsort()[::-1]]
markers.set_index('Gene Names', inplace=True)
return markers
Gene ranks for cluster 0¶
cluster_0=get_scores("0").head(1500)
cluster_0
| Scores | |
|---|---|
| Gene Names | |
| CTSL | 42.473652 |
| GSTO1 | 36.598206 |
| TNFAIP6 | 34.366024 |
| PRRX1 | 34.347439 |
| MEDAG | 32.788410 |
| ... | ... |
| AC126175.1 | -8.808897 |
| NDFIP1 | 8.804882 |
| ZNF121 | 8.803287 |
| FAT1 | 8.801478 |
| CREB3L2 | 8.800745 |
1500 rows × 1 columns
Now we will use the python package decoupler [Badia-i-Mompel et al., 2022] to perform GSEA enrichment tests on our data.
Retrieve gene sets¶
Download and read the gmt file for the REACTOME pathways annotated in the C2 collection of MSigDB.
# # Downloading reactome pathways
# from pathlib import Path
# if not Path("c2.cp.reactome.v7.5.1.symbols.gmt").is_file():
# !wget -O 'c2.cp.reactome.v7.5.1.symbols.gmt' https://figshare.com/ndownloader/files/35233771
def gmt_to_decoupler(pth: Path) -> pd.DataFrame:
"""
Parse a gmt file to a decoupler pathway dataframe.
"""
from itertools import chain, repeat
pathways = {}
with Path(pth).open("r") as f:
for line in f:
name, _, *genes = line.strip().split("\t")
pathways[name] = genes
return pd.DataFrame.from_records(
chain.from_iterable(zip(repeat(k), v) for k, v in pathways.items()),
columns=["geneset", "genesymbol"],
)
reactome = gmt_to_decoupler("c2.cp.reactome.v7.5.1.symbols.gmt")
reactome
# Filter duplicates
reactome = reactome[~reactome.duplicated(("geneset", "genesymbol"))]
reactome
| geneset | genesymbol | |
|---|---|---|
| 0 | REACTOME_INTERLEUKIN_6_SIGNALING | JAK2 |
| 1 | REACTOME_INTERLEUKIN_6_SIGNALING | TYK2 |
| 2 | REACTOME_INTERLEUKIN_6_SIGNALING | CBL |
| 3 | REACTOME_INTERLEUKIN_6_SIGNALING | STAT1 |
| 4 | REACTOME_INTERLEUKIN_6_SIGNALING | IL6ST |
| ... | ... | ... |
| 89471 | REACTOME_ION_CHANNEL_TRANSPORT | FXYD7 |
| 89472 | REACTOME_ION_CHANNEL_TRANSPORT | UBA52 |
| 89473 | REACTOME_ION_CHANNEL_TRANSPORT | ATP6V1E2 |
| 89474 | REACTOME_ION_CHANNEL_TRANSPORT | ASIC5 |
| 89475 | REACTOME_ION_CHANNEL_TRANSPORT | FXYD1 |
89476 rows × 2 columns
Running GSEA¶
We will read the gene sets first. Unlike packages like fgsea that automatically filter gene sets based on maximum size, decoupler, by default, does not perform such filtering. Instead, we will manually screen gene sets, ensuring they contain a minimum of 15 genes and a maximum of 500 genes.
# Filtering genesets to match behaviour of fgsea
geneset_size = reactome.groupby("geneset").size()
gsea_genesets = geneset_size.index[(geneset_size > 15) & (geneset_size < 500)]
We’ll use the t-statistics from the t-test to rank the genes for the cluster 0 and computes p-values for each of the pathways.
def pathway_enrichment(cluster):
# Extract the scores of the differentially expressed genes
cluster_df=get_scores(cluster).head(4000)
# print(cluster_df)
# Run the GSEA using the reactom database
scores, norm, pvals = decoupler.run_gsea(
cluster_df.T,
reactome[reactome["geneset"].isin(gsea_genesets)],
source="geneset",
target="genesymbol",)
# Get t-stat
gsea_results = pd.concat({"score": scores.T, "norm": norm.T, "pval": pvals.T}, axis=1).droplevel(level=1, axis=1).sort_values("pval")
print(gsea_results)
return gsea_results
%%time
cluster_0=pathway_enrichment("0")
cluster_0
Scores
Gene Names
CTSL 42.473652
GSTO1 36.598206
TNFAIP6 34.366024
PRRX1 34.347439
MEDAG 32.788410
... ...
LINC00460 4.720742
PRH1 4.720646
DNER 4.719110
SLC39A6 4.718694
SSPN 4.718112
[4000 rows x 1 columns]
score norm \
source
REACTOME_ABC_FAMILY_PROTEINS_MEDIATED_TRANSPORT 0.585699 2.674584
REACTOME_REGULATION_OF_PTEN_STABILITY_AND_ACTIVITY 0.580498 2.621807
REACTOME_REGULATION_OF_MRNA_STABILITY_BY_PROTEI... 0.542736 2.544917
REACTOME_REGULATION_OF_HMOX1_EXPRESSION_AND_ACT... 0.601533 2.734210
REACTOME_REGULATION_OF_EXPRESSION_OF_SLITS_AND_... 0.671769 3.611585
... ... ...
REACTOME_TRIGLYCERIDE_CATABOLISM 0.212516 0.498162
REACTOME_REPRODUCTION 0.137751 0.482925
REACTOME_N_GLYCAN_ANTENNAE_ELONGATION_IN_THE_ME... 0.215917 0.493533
REACTOME_MEIOSIS 0.121586 0.404092
REACTOME_TNF_RECEPTOR_SUPERFAMILY_TNFSF_MEMBERS... 0.157981 0.412662
pval
source
REACTOME_ABC_FAMILY_PROTEINS_MEDIATED_TRANSPORT 0.000000
REACTOME_REGULATION_OF_PTEN_STABILITY_AND_ACTIVITY 0.000000
REACTOME_REGULATION_OF_MRNA_STABILITY_BY_PROTEI... 0.000000
REACTOME_REGULATION_OF_HMOX1_EXPRESSION_AND_ACT... 0.000000
REACTOME_REGULATION_OF_EXPRESSION_OF_SLITS_AND_... 0.000000
... ...
REACTOME_TRIGLYCERIDE_CATABOLISM 0.984871
REACTOME_REPRODUCTION 0.985770
REACTOME_N_GLYCAN_ANTENNAE_ELONGATION_IN_THE_ME... 0.987915
REACTOME_MEIOSIS 0.993455
REACTOME_TNF_RECEPTOR_SUPERFAMILY_TNFSF_MEMBERS... 1.000000
[833 rows x 3 columns]
CPU times: user 4min 58s, sys: 455 ms, total: 4min 58s
Wall time: 42.1 s
| score | norm | pval | |
|---|---|---|---|
| source | |||
| REACTOME_ABC_FAMILY_PROTEINS_MEDIATED_TRANSPORT | 0.585699 | 2.674584 | 0.000000 |
| REACTOME_REGULATION_OF_PTEN_STABILITY_AND_ACTIVITY | 0.580498 | 2.621807 | 0.000000 |
| REACTOME_REGULATION_OF_MRNA_STABILITY_BY_PROTEINS_THAT_BIND_AU_RICH_ELEMENTS | 0.542736 | 2.544917 | 0.000000 |
| REACTOME_REGULATION_OF_HMOX1_EXPRESSION_AND_ACTIVITY | 0.601533 | 2.734210 | 0.000000 |
| REACTOME_REGULATION_OF_EXPRESSION_OF_SLITS_AND_ROBOS | 0.671769 | 3.611585 | 0.000000 |
| ... | ... | ... | ... |
| REACTOME_TRIGLYCERIDE_CATABOLISM | 0.212516 | 0.498162 | 0.984871 |
| REACTOME_REPRODUCTION | 0.137751 | 0.482925 | 0.985770 |
| REACTOME_N_GLYCAN_ANTENNAE_ELONGATION_IN_THE_MEDIAL_TRANS_GOLGI | 0.215917 | 0.493533 | 0.987915 |
| REACTOME_MEIOSIS | 0.121586 | 0.404092 | 0.993455 |
| REACTOME_TNF_RECEPTOR_SUPERFAMILY_TNFSF_MEMBERS_MEDIATING_NON_CANONICAL_NF_KB_PATHWAY | 0.157981 | 0.412662 | 1.000000 |
833 rows × 3 columns
(
so.Plot(
data=(
cluster_0.head(20).assign(
**{"-log10(pval)": lambda x: -np.log10(x["pval"])}
)
),
x="-log10(pval)",
y="source",
).add(so.Bar())
)
cluster_0.head(50).pval
source REACTOME_ABC_FAMILY_PROTEINS_MEDIATED_TRANSPORT 0.0 REACTOME_REGULATION_OF_PTEN_STABILITY_AND_ACTIVITY 0.0 REACTOME_REGULATION_OF_MRNA_STABILITY_BY_PROTEINS_THAT_BIND_AU_RICH_ELEMENTS 0.0 REACTOME_REGULATION_OF_HMOX1_EXPRESSION_AND_ACTIVITY 0.0 REACTOME_REGULATION_OF_EXPRESSION_OF_SLITS_AND_ROBOS 0.0 REACTOME_DISEASES_OF_SIGNAL_TRANSDUCTION_BY_GROWTH_FACTOR_RECEPTORS_AND_SECOND_MESSENGERS 0.0 REACTOME_DISORDERS_OF_TRANSMEMBRANE_TRANSPORTERS 0.0 REACTOME_PTEN_REGULATION 0.0 REACTOME_PROTEIN_UBIQUITINATION 0.0 REACTOME_DNA_REPAIR 0.0 REACTOME_DNA_REPLICATION 0.0 REACTOME_DNA_REPLICATION_PRE_INITIATION 0.0 REACTOME_PROTEIN_LOCALIZATION 0.0 REACTOME_PROTEIN_FOLDING 0.0 REACTOME_DOWNSTREAM_SIGNALING_EVENTS_OF_B_CELL_RECEPTOR_BCR 0.0 REACTOME_PROGRAMMED_CELL_DEATH 0.0 REACTOME_PROCESSING_OF_DNA_DOUBLE_STRAND_BREAK_ENDS 0.0 REACTOME_PROCESSING_OF_CAPPED_INTRON_CONTAINING_PRE_MRNA 0.0 REACTOME_PINK1_PRKN_MEDIATED_MITOPHAGY 0.0 REACTOME_PD_1_SIGNALING 0.0 REACTOME_REGULATION_OF_RAS_BY_GAPS 0.0 REACTOME_REGULATION_OF_RUNX2_EXPRESSION_AND_ACTIVITY 0.0 REACTOME_DEUBIQUITINATION 0.0 REACTOME_REGULATION_OF_RUNX3_EXPRESSION_AND_ACTIVITY 0.0 REACTOME_SELENOAMINO_ACID_METABOLISM 0.0 REACTOME_CLEC7A_DECTIN_1_SIGNALING 0.0 REACTOME_SELECTIVE_AUTOPHAGY 0.0 REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 0.0 REACTOME_RUNX1_REGULATES_TRANSCRIPTION_OF_GENES_INVOLVED_IN_DIFFERENTIATION_OF_HSCS 0.0 REACTOME_RRNA_PROCESSING 0.0 REACTOME_RRNA_MODIFICATION_IN_THE_NUCLEUS_AND_CYTOSOL 0.0 REACTOME_COOPERATION_OF_PREFOLDIN_AND_TRIC_CCT_IN_ACTIN_AND_TUBULIN_FOLDING 0.0 REACTOME_CROSS_PRESENTATION_OF_SOLUBLE_EXOGENOUS_ANTIGENS_ENDOSOMES 0.0 REACTOME_PCP_CE_PATHWAY 0.0 REACTOME_CYCLIN_A_CDK2_ASSOCIATED_EVENTS_AT_S_PHASE_ENTRY 0.0 REACTOME_C_TYPE_LECTIN_RECEPTORS_CLRS 0.0 REACTOME_RESPONSE_OF_EIF2AK4_GCN2_TO_AMINO_ACID_DEFICIENCY 0.0 REACTOME_DECTIN_1_MEDIATED_NONCANONICAL_NF_KB_SIGNALING 0.0 REACTOME_RESPIRATORY_ELECTRON_TRANSPORT_ATP_SYNTHESIS_BY_CHEMIOSMOTIC_COUPLING_AND_HEAT_PRODUCTION_BY_UNCOUPLING_PROTEINS 0.0 REACTOME_DEFECTIVE_CFTR_CAUSES_CYSTIC_FIBROSIS 0.0 REACTOME_DEGRADATION_OF_AXIN 0.0 REACTOME_DEGRADATION_OF_DVL 0.0 REACTOME_DEGRADATION_OF_GLI1_BY_THE_PROTEASOME 0.0 REACTOME_RESPIRATORY_ELECTRON_TRANSPORT 0.0 REACTOME_CYTOPROTECTION_BY_HMOX1 0.0 REACTOME_ORGANELLE_BIOGENESIS_AND_MAINTENANCE 0.0 REACTOME_ORC1_REMOVAL_FROM_CHROMATIN 0.0 REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 0.0 REACTOME_GLOBAL_GENOME_NUCLEOTIDE_EXCISION_REPAIR_GG_NER 0.0 REACTOME_MITOCHONDRIAL_PROTEIN_IMPORT 0.0 Name: pval, dtype: float32
We make a bar plot of top 20 pathways significantly enriched in stimulated cluster 0 cells
(
so.Plot(
data=(
gsea_results.head(20).assign(
**{"-log10(pval)": lambda x: -np.log10(x["pval"])}
)
),
x="-log10(pval)",
y="source",
).add(so.Bar())
)
# Cluster 1
get_t_statistics("1")
# Cluster 1
get_t_statistics("2")
# Cluster 1
get_t_statistics("3")
In the graph, the y-axis represents the names of pathways, while the x-axis illustrates the adjusted p-values. The greater the height of the bar, the more pronounced the significance of the pathway. The pathways are arranged in order of their level of significance.
reactome=reactome["genesymbol"].to_list()
reactome.index("CTSL")
3446
Results and Discussion¶
- Cathepsin L (CTSL), is a lysosomal acid cysteine protease that has been reported to play a critical role in chemosensitivity and tumor progression. I has been found that CTSL served as a prognostic marker for poor clinical outcomes in Neuroblastoma patients [4] and breat cancer [5].
- AREG has ben suggested as a valuable prognostic biomarker in invasive breast cancer and promising therapeutic targets, especially in ER-negative breast cancer [6]
- The study suggests co-expresion patterns of KRT8 and KRT 18 genes 8, 5 and 9. Previous studies have shown the co-expression of these genes in cancer breat patients [7].
- The expression of the HLA-DQA1 significantly upregulated in breast cancer a sin line with the previous findings [8].
- IGFBP7 was reported to play tumor suppressive role in breast cancer[9]
Conclusion¶
- This project has shown the co-expression pattersn for KRT8 and KRT 18 genes genes
- Most cancer associated genes are CTSL,AREG and DCN
- Kim H, Whitman AA, Wisniewska K, Kakati RT, Garcia-Recio S, Calhoun BC, Franco HL, Perou CM, Spanheimer PM. Tamoxifen Response at Single Cell Resolution in Estrogen Receptor-Positive Primary Human Breast Tumors. bioRxiv [Preprint]. 2023 Apr 19:2023.04.01.535159. doi: 10.1101/2023.04.01.535159. Update in: Clin Cancer Res. 2023 Sep 25;: PMID: 37066379; PMCID: PMC10103953.
- Wei Z, Shen Y, Zhou C, Cao Y, Deng H, Shen Z. CD3D: a prognostic biomarker associated with immune infiltration and immunotherapeutic response in head and neck squamous cell carcinoma. Bioengineered. 2022 May;13(5):13784-13800. doi: 10.1080/21655979.2022.2084254. PMID: 35712757; PMCID: PMC9276048.
- Lu, J., Ahmad, R., Nguyen, T. et al. Heterogeneity and transcriptome changes of human CD8+ T cells across nine decades of life. Nat Commun 13, 5128 (2022). https://doi.org/10.1038/s41467-022-32869-x
- Du, X., Ding, L., Huang, S., Li, F., Yan, Y., Tang, R., ... & Wang, W. (2022). Cathepsin L promotes chemresistance to neuroblastoma by modulating serglycin. Frontiers in Pharmacology, 13, 920022.
- Zhang, L., Zhao, Y., Yang, J., Zhu, Y., Li, T., Liu, X., ... & Fu, J. (2023). CTSL, a prognostic marker of breast cancer, that promotes proliferation, migration, and invasion in cells in triple-negative breast cancer. Frontiers in Oncology, 13, 1158087.
- Xiang, G., Liu, F., Liu, J., Meng, Q., Li, N., & Niu, Y. (2019). Prognostic role of Amphiregulin and the correlation with androgen receptor in invasive breast cancer. Pathology-Research and Practice, 215(6), 152414.
- Wilson, C. A., & Dering, J. (2004). Recent translational research: microarray expression profiling of breast cancer–beyond classification and prognostic markers?. Breast Cancer Research, 6, 1-9.
- Wu, G., Xiao, G., Yan, Y., Guo, C., Hu, N., & Shen, S. (2022). Bioinformatics analysis of the clinical significance of HLA class II in breast cancer. Medicine, 101(40).
- Li, D., Xia, L., Huang, P., Wang, Z., Guo, Q., Huang, C., ... & Qin, S. (2023). Cancer-associated fibroblast-secreted IGFBP7 promotes gastric cancer by enhancing tumor associated macrophage infiltration via FGF2/FGFR1/PI3K/AKT axis. Cell Death Discovery, 9(1), 17.
- https://www.sc-best-practices.org/introduction/analysis_tools.html
- https://www.sc-best-practices.org/conditions/gsea_pathway.html